Root Mean Squared Error (RMSE) — Regression Metric (From Scratch)#
RMSE measures the typical size of prediction errors in the same units as the target. It is the square-root of mean squared error (MSE), so it penalizes large errors more than MAE.
Goals
Build intuition with small numeric examples + Plotly visuals
Derive the formula (and gradients) with clear notation
Implement RMSE in NumPy (from scratch) and validate vs scikit-learn
Use RMSE/MSE to fit a simple linear regression with gradient descent
Summarize pros/cons, good use cases, and common pitfalls
Quick import#
from sklearn.metrics import root_mean_squared_error
Equivalent: mean_squared_error(..., squared=False).
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
rng = np.random.default_rng(7)
import plotly
import sklearn
print("numpy :", np.__version__)
print("pandas :", pd.__version__)
print("sklearn:", sklearn.__version__)
print("plotly :", plotly.__version__)
numpy : 1.26.2
pandas : 2.1.3
sklearn: 1.6.0
plotly : 6.5.2
Prerequisites#
Basic regression setup: true targets \(y\) and predictions \(\hat y\)
Vector norms and means
(Optional) Basic derivatives for the gradient section
1) Definition and notation#
Let:
\(y \in \mathbb{R}^n\) be the true targets
\(\hat y \in \mathbb{R}^n\) be the predictions
residuals (signed errors) \(r_i = \hat y_i - y_i\)
The mean squared error (MSE) is:
The root mean squared error (RMSE) is:
Vector form#
With residual vector \(r = \hat y - y\):
Key interpretation: RMSE is a scaled Euclidean distance between the prediction vector and the target vector.
Weighted RMSE#
For non-negative sample weights \(w_i\):
Multioutput targets#
For \(y, \hat y \in \mathbb{R}^{n \times d}\), compute per-output RMSE:
and then aggregate (uniform average by default, or a weighted average over outputs).
2) Intuition: “average error size” with extra penalty for big misses#
If you make one error twice as large, its squared contribution becomes 4× larger.
That means RMSE:
has the same units as \(y\) (unlike MSE)
behaves like a “typical” error magnitude
is more sensitive to outliers than MAE because of the square
A common statistical view: if residuals are i.i.d. Gaussian with constant variance, minimizing MSE/RMSE is equivalent to maximum likelihood.
3) A tiny worked example#
We’ll compute RMSE step-by-step on a small set of points.
y_true = np.array([3.0, -0.5, 2.0, 7.0, 4.2])
y_pred = np.array([2.5, 0.0, 2.1, 7.8, 1.9])
residual = y_pred - y_true
abs_error = np.abs(residual)
sq_error = residual**2
df = pd.DataFrame(
{
"y_true": y_true,
"y_pred": y_pred,
"residual (y_pred - y_true)": residual,
"|residual|": abs_error,
"residual^2": sq_error,
}
)
mse = float(np.mean(sq_error))
rmse = float(np.sqrt(mse))
rmse_sklearn = root_mean_squared_error(y_true, y_pred)
df, mse, rmse, rmse_sklearn
( y_true y_pred residual (y_pred - y_true) |residual| residual^2
0 3.0 2.5 -0.5 0.5 0.25
1 -0.5 0.0 0.5 0.5 0.25
2 2.0 2.1 0.1 0.1 0.01
3 7.0 7.8 0.8 0.8 0.64
4 4.2 1.9 -2.3 2.3 5.29,
1.288,
1.1349008767288886,
1.1349008767288886)
4) How squaring changes “importance” across samples#
MAE gives each sample weight proportional to \(|r_i|\).
MSE/RMSE gives each sample weight proportional to \(r_i^2\), so large residuals dominate more.
Below we plot the fraction of total error each sample contributes under MAE vs MSE/RMSE.
idx = np.arange(len(y_true))
mae_weights = abs_error / abs_error.sum()
mse_weights = sq_error / sq_error.sum()
fig = go.Figure()
fig.add_trace(go.Bar(x=idx, y=mae_weights, name="MAE weight (|r| / sum|r|)"))
fig.add_trace(go.Bar(x=idx, y=mse_weights, name="MSE/RMSE weight (r^2 / sum r^2)"))
fig.update_layout(
barmode="group",
title="Per-sample influence under MAE vs RMSE",
xaxis_title="sample index",
yaxis_title="fraction of total error",
)
fig.show()
5) Outlier sensitivity: RMSE vs MAE#
We’ll keep all residuals fixed except one, and sweep that one residual from 0 to a large value.
Because RMSE squares residuals, it grows faster than MAE as the outlier grows.
n = 200
base_residuals = rng.normal(0.0, 1.0, size=n)
base_residuals[0] = 0.0
outlier_values = np.linspace(0, 15, 200)
rmse_vals = []
mae_vals = []
for v in outlier_values:
r = base_residuals.copy()
r[0] = v
rmse_vals.append(float(np.sqrt(np.mean(r**2))))
mae_vals.append(float(np.mean(np.abs(r))))
df_sweep = pd.DataFrame({"outlier |r|": outlier_values, "RMSE": rmse_vals, "MAE": mae_vals})
fig = px.line(
df_sweep,
x="outlier |r|",
y=["RMSE", "MAE"],
title="Single outlier sweep: RMSE grows faster than MAE",
labels={"value": "metric value (same units as y)", "variable": "metric"},
)
fig.show()
6) Key property: constant predictor → mean#
If your model can only predict a constant \(c\) for every sample (\(\hat y_i = c\)), then:
Because the square-root is monotonic, the minimizer of RMSE\((c)\) is the same as the minimizer of:
Differentiate and set to zero:
This is one reason RMSE is sensitive to outliers: the mean shifts.
In contrast, MAE is minimized by the median (more robust to outliers).
y_const = rng.normal(0.0, 1.0, size=60)
y_const[0] = 8.0 # add a clear outlier
c_grid = np.linspace(y_const.min() - 1, y_const.max() + 1, 400)
rmse_curve = np.sqrt(np.mean((y_const[None, :] - c_grid[:, None]) ** 2, axis=1))
mae_curve = np.mean(np.abs(y_const[None, :] - c_grid[:, None]), axis=1)
c_mean = float(np.mean(y_const))
c_median = float(np.median(y_const))
y_top = float(max(rmse_curve.max(), mae_curve.max()))
fig = go.Figure()
fig.add_trace(go.Scatter(x=c_grid, y=rmse_curve, mode="lines", name="RMSE(c)"))
fig.add_trace(go.Scatter(x=c_grid, y=mae_curve, mode="lines", name="MAE(c)"))
fig.add_vline(x=c_mean, line_dash="dash", line_color="#1f77b4")
fig.add_vline(x=c_median, line_dash="dash", line_color="#ff7f0e")
fig.add_annotation(x=c_mean, y=y_top, text="mean", showarrow=False, yshift=10, font=dict(color="#1f77b4"))
fig.add_annotation(
x=c_median,
y=y_top,
text="median",
showarrow=False,
yshift=10,
font=dict(color="#ff7f0e"),
)
fig.update_layout(
title="Constant predictor: RMSE is minimized by the mean (MAE by the median)",
xaxis_title="constant prediction c",
yaxis_title="metric value",
)
fig.show()
7) NumPy implementation (from scratch)#
We’ll implement RMSE in a way that matches scikit-learn’s signature:
supports 1D or 2D targets (
n_outputs)optional
sample_weightmultioutput: return per-output values (raw_values) or average (uniform_average)
def _as_2d(y):
y = np.asarray(y, dtype=float)
if y.ndim == 1:
return y.reshape(-1, 1)
if y.ndim == 2:
return y
raise ValueError("y must be 1D or 2D (n_samples,) or (n_samples, n_outputs).")
def mse_np(y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"):
"""Mean squared error with scikit-learn-like multioutput handling."""
y_true_2d = _as_2d(y_true)
y_pred_2d = _as_2d(y_pred)
if y_true_2d.shape != y_pred_2d.shape:
raise ValueError(f"shape mismatch: y_true{y_true_2d.shape} vs y_pred{y_pred_2d.shape}")
residual = y_pred_2d - y_true_2d
if sample_weight is None:
mse_per_output = np.mean(residual**2, axis=0)
else:
w = np.asarray(sample_weight, dtype=float)
if w.ndim != 1:
raise ValueError("sample_weight must be 1D of shape (n_samples,).")
if w.shape[0] != y_true_2d.shape[0]:
raise ValueError("sample_weight length must match n_samples.")
w = w.reshape(-1, 1)
mse_per_output = np.sum(w * residual**2, axis=0) / np.sum(w, axis=0)
if isinstance(multioutput, str):
if multioutput == "raw_values":
return mse_per_output
if multioutput == "uniform_average":
return float(np.mean(mse_per_output))
raise ValueError(
"multioutput must be 'raw_values', 'uniform_average', or array-like of shape (n_outputs,)."
)
weights = np.asarray(multioutput, dtype=float)
if weights.shape != (mse_per_output.shape[0],):
raise ValueError("multioutput weights must match n_outputs.")
return float(np.average(mse_per_output, weights=weights))
def rmse_np(y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"):
"""Root mean squared error (RMSE): sqrt(mean((y_pred - y_true)^2))."""
mse_per_output = mse_np(
y_true,
y_pred,
sample_weight=sample_weight,
multioutput="raw_values",
)
rmse_per_output = np.sqrt(mse_per_output)
if isinstance(multioutput, str):
if multioutput == "raw_values":
return rmse_per_output
if multioutput == "uniform_average":
return float(np.mean(rmse_per_output))
raise ValueError(
"multioutput must be 'raw_values', 'uniform_average', or array-like of shape (n_outputs,)."
)
weights = np.asarray(multioutput, dtype=float)
if weights.shape != (rmse_per_output.shape[0],):
raise ValueError("multioutput weights must match n_outputs.")
return float(np.average(rmse_per_output, weights=weights))
y_true_rand = rng.normal(size=(50, 3))
y_pred_rand = y_true_rand + rng.normal(scale=0.5, size=y_true_rand.shape)
print("ours raw:", rmse_np(y_true_rand, y_pred_rand, multioutput="raw_values"))
print("sk raw:", root_mean_squared_error(y_true_rand, y_pred_rand, multioutput="raw_values"))
sample_w = rng.uniform(0.5, 2.0, size=y_true_rand.shape[0])
print("ours weighted:", rmse_np(y_true_rand, y_pred_rand, sample_weight=sample_w))
print("sk weighted:", root_mean_squared_error(y_true_rand, y_pred_rand, sample_weight=sample_w))
ours raw: [0.45323041 0.53601038 0.40367295]
sk raw: [0.45323041 0.53601038 0.40367295]
ours weighted: 0.4660120349767279
sk weighted: 0.4660120349767279
8) RMSE as an objective: gradients and optimization#
RMSE is often reported to humans because it has the same units as \(y\).
When training with gradient-based methods, people often minimize MSE instead:
RMSE and MSE have the same minimizer (square-root is monotonic)
the MSE gradient is simpler and avoids dividing by RMSE
Still, we can differentiate RMSE and optimize it directly; the next section shows both.
8.1 Gradients#
Let \(r_i = \hat y_i - y_i\) and
Gradient w.r.t. predictions:
If RMSE \(> 0\):
So (for RMSE \(> 0\)) the gradients point in the same direction; they differ by a scalar factor \(\frac{1}{2\,\mathrm{RMSE}}\).
For a linear model \(\hat y_i = w x_i + b\), the parameter gradients are:
(At a perfect fit where RMSE \(= 0\), the derivative is undefined; in code we add a tiny \(\varepsilon\).)
n = 200
x = rng.uniform(-3, 3, size=n)
true_w, true_b = 1.8, -0.7
y = true_w * x + true_b + rng.normal(0, 0.8, size=n)
perm = rng.permutation(n)
train_size = int(0.8 * n)
train_idx = perm[:train_size]
test_idx = perm[train_size:]
x_tr, y_tr = x[train_idx], y[train_idx]
x_te, y_te = x[test_idx], y[test_idx]
x_tr.shape, x_te.shape
((160,), (40,))
df_data = pd.DataFrame(
{
"x": np.concatenate([x_tr, x_te]),
"y": np.concatenate([y_tr, y_te]),
"split": np.array(["train"] * len(x_tr) + ["test"] * len(x_te)),
}
)
fig = px.scatter(
df_data,
x="x",
y="y",
color="split",
title="Synthetic 1D regression data",
labels={"x": "feature x", "y": "target y"},
)
fig.show()
9) Using RMSE/MSE to fit linear regression (from scratch)#
Model (one feature):
We’ll fit \((w, b)\) with gradient descent using either:
objective = MSE
objective = RMSE
def predict_linear(x, w, b):
x = np.asarray(x, dtype=float)
return w * x + b
def fit_linear_gd(x, y, *, lr=0.05, steps=200, objective="mse"):
"""Fit y ≈ w x + b with gradient descent (objective: 'mse' or 'rmse')."""
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
w = 0.0
b = 0.0
n = x.shape[0]
hist = {"mse": [], "rmse": [], "w": [], "b": []}
for _ in range(steps):
y_hat = predict_linear(x, w, b)
r = y_hat - y
mse = float(np.mean(r**2))
rmse = float(np.sqrt(mse))
if objective == "mse":
grad_w = (2.0 / n) * float(np.dot(r, x))
grad_b = (2.0 / n) * float(np.sum(r))
elif objective == "rmse":
denom = max(rmse, 1e-12)
grad_w = (1.0 / (n * denom)) * float(np.dot(r, x))
grad_b = (1.0 / (n * denom)) * float(np.sum(r))
else:
raise ValueError("objective must be 'mse' or 'rmse'.")
w -= lr * grad_w
b -= lr * grad_b
hist["mse"].append(mse)
hist["rmse"].append(rmse)
hist["w"].append(w)
hist["b"].append(b)
return w, b, hist
w_mse, b_mse, hist_mse = fit_linear_gd(x_tr, y_tr, lr=0.05, steps=200, objective="mse")
w_rmse, b_rmse, hist_rmse = fit_linear_gd(x_tr, y_tr, lr=0.05, steps=200, objective="rmse")
results = pd.DataFrame(
{
"objective": ["mse", "rmse"],
"w": [w_mse, w_rmse],
"b": [b_mse, b_rmse],
"train_rmse": [
rmse_np(y_tr, predict_linear(x_tr, w_mse, b_mse)),
rmse_np(y_tr, predict_linear(x_tr, w_rmse, b_rmse)),
],
"test_rmse": [
rmse_np(y_te, predict_linear(x_te, w_mse, b_mse)),
rmse_np(y_te, predict_linear(x_te, w_rmse, b_rmse)),
],
}
)
results
| objective | w | b | train_rmse | test_rmse | |
|---|---|---|---|---|---|
| 0 | mse | 1.846541 | -0.591784 | 0.749457 | 0.674895 |
| 1 | rmse | 1.846541 | -0.591783 | 0.749457 | 0.674896 |
df_hist = pd.DataFrame(
{
"iteration": np.arange(1, len(hist_mse["rmse"]) + 1),
"rmse (objective=mse)": hist_mse["rmse"],
"rmse (objective=rmse)": hist_rmse["rmse"],
}
)
fig = px.line(
df_hist,
x="iteration",
y=["rmse (objective=mse)", "rmse (objective=rmse)"],
title="Training curve (RMSE on the training set)",
labels={"value": "RMSE", "variable": "training objective"},
)
fig.show()
x_line = np.linspace(x.min(), x.max(), 200)
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_tr, y=y_tr, mode="markers", name="train"))
fig.add_trace(go.Scatter(x=x_te, y=y_te, mode="markers", name="test"))
fig.add_trace(
go.Scatter(
x=x_line,
y=predict_linear(x_line, w_mse, b_mse),
mode="lines",
name="fit (objective=mse)",
)
)
fig.add_trace(
go.Scatter(
x=x_line,
y=predict_linear(x_line, w_rmse, b_rmse),
mode="lines",
name="fit (objective=rmse)",
line=dict(dash="dash"),
)
)
fig.update_layout(
title="Linear regression fits (same optimum, different objective scaling)",
xaxis_title="x",
yaxis_title="y",
)
fig.show()
10) Sanity check: closed-form least squares and scikit-learn#
For linear regression with squared error, the optimum has a closed-form (least squares) solution.
We’ll compare:
gradient descent (above)
NumPy least squares (
np.linalg.lstsq)sklearn.linear_model.LinearRegression
A = np.column_stack([x_tr, np.ones_like(x_tr)])
w_ls, b_ls = np.linalg.lstsq(A, y_tr, rcond=None)[0]
rmse_train_ls = rmse_np(y_tr, predict_linear(x_tr, w_ls, b_ls))
rmse_test_ls = rmse_np(y_te, predict_linear(x_te, w_ls, b_ls))
pd.DataFrame(
{
"method": ["least_squares"],
"w": [w_ls],
"b": [b_ls],
"train_rmse": [rmse_train_ls],
"test_rmse": [rmse_test_ls],
}
)
| method | w | b | train_rmse | test_rmse | |
|---|---|---|---|---|---|
| 0 | least_squares | 1.846541 | -0.591784 | 0.749457 | 0.674895 |
sk_model = LinearRegression().fit(x_tr.reshape(-1, 1), y_tr)
w_sk = float(sk_model.coef_[0])
b_sk = float(sk_model.intercept_)
rmse_train_sk = root_mean_squared_error(y_tr, sk_model.predict(x_tr.reshape(-1, 1)))
rmse_test_sk = root_mean_squared_error(y_te, sk_model.predict(x_te.reshape(-1, 1)))
pd.DataFrame(
{
"method": ["sklearn"],
"w": [w_sk],
"b": [b_sk],
"train_rmse": [rmse_train_sk],
"test_rmse": [rmse_test_sk],
}
)
| method | w | b | train_rmse | test_rmse | |
|---|---|---|---|---|---|
| 0 | sklearn | 1.846541 | -0.591784 | 0.749457 | 0.674895 |
11) Practical usage notes (scikit-learn)#
For reporting:
root_mean_squared_error(y_true, y_pred).For cross-validation: scorers are usually higher-is-better, so scikit-learn uses
"neg_root_mean_squared_error".For multi-output regression: set
multioutput="raw_values"to get one RMSE per output.For weighted datasets: pass
sample_weight.
y_true_2 = np.column_stack([y_true, 2 * y_true])
y_pred_2 = np.column_stack([y_pred, 2 * y_pred + 0.2])
print("ours raw:", rmse_np(y_true_2, y_pred_2, multioutput="raw_values"))
print("sk raw:", root_mean_squared_error(y_true_2, y_pred_2, multioutput="raw_values"))
output_weights = np.array([0.25, 0.75])
print("ours weighted outputs:", rmse_np(y_true_2, y_pred_2, multioutput=output_weights))
print("sk weighted outputs:", root_mean_squared_error(y_true_2, y_pred_2, multioutput=output_weights))
sample_w = np.linspace(1.0, 2.0, len(y_true))
print("ours sample_weight:", rmse_np(y_true, y_pred, sample_weight=sample_w))
print("sk sample_weight:", root_mean_squared_error(y_true, y_pred, sample_weight=sample_w))
ours raw: [1.13490088 2.22890107]
sk raw: [1.13490088 2.22890107]
ours weighted outputs: 1.955401025072826
sk weighted outputs: 1.955401025072826
ours sample_weight: 1.2794530081249567
sk sample_weight: 1.2794530081249567
12) Pros, cons, and when to use RMSE#
Pros
Same units as the target → easy to interpret
Differentiable for RMSE \(> 0\) (unlike MAE at 0) and convex for linear models
Strongly penalizes large errors → good when big misses are especially costly
Closely tied to least squares / Gaussian noise assumptions
Cons
Sensitive to outliers and heavy-tailed noise
Scale-dependent: RMSE values are not comparable across targets with different units/scales
Can hide systematic bias unless you also inspect residuals (RMSE is a single number)
Good default when
You have a regression problem and care more about large errors than small ones
Residuals are roughly symmetric and not extremely heavy-tailed
13) Common pitfalls and diagnostics#
Always plot residuals: a low RMSE can still mask patterns (non-linearity, heteroscedasticity).
Outliers dominate: if your data has rare but huge errors, consider MAE, Huber loss, or quantile losses.
Scale matters: if you need comparability, report a normalized RMSE (e.g., divide by target std or range).
Skewed targets: if errors are multiplicative (percentage-like), consider RMSLE or modeling on a log scale.
y_hat_te = predict_linear(x_te, w_ls, b_ls)
resid_te = y_hat_te - y_te
fig = px.scatter(
x=x_te,
y=resid_te,
title="Residuals on test set (least squares fit)",
labels={"x": "x", "y": "residual (y_pred - y_true)"},
)
fig.add_hline(y=0, line_dash="dash")
fig.show()
fig = px.histogram(
x=resid_te,
nbins=25,
title="Residual distribution on test set",
labels={"x": "residual (y_pred - y_true)", "count": "count"},
)
fig.show()
Exercises#
Implement normalized RMSE: divide RMSE by
(y_true.max() - y_true.min())ory_true.std().Add a single extreme outlier to the regression dataset and compare the fitted line under RMSE/MSE vs MAE.
Implement a finite-difference gradient check for the RMSE gradient.
References#
scikit-learn metric: https://scikit-learn.org/stable/api/sklearn.metrics.html
scikit-learn User Guide (mean squared error): https://scikit-learn.org/stable/modules/model_evaluation.html
The Elements of Statistical Learning (Hastie, Tibshirani, Friedman) — least squares and regression basics